Delphi’s COVIDcast Project:
An Ecosystem for Tracking and Forecasting the Pandemic

Ryan Tibshirani
Statistics and Machine Learning
Carnegie Mellon University


Delphi Carnegie Mellon University

May 19, 2021

Delphi: Then and Now

COVIDcast Indicators

COVIDcast Ecosystem

Outline For This Talk

I can’t cover all of this! I’ll focus on our API and give some basic data demos (reproducible: all code included) then reflect on a few lessons learned


Outline:

Part 1: API & Data Demos

Part 1: API & Data Demos

COVIDcast API

The COVIDcast API is based on HTTP GET queries and returns data in JSON or CSV format

Parameter Description Examples
data_source data source doctor-visits or fb-survey
signal signal derived from data source smoothed_cli or smoothed_adj_cli
time_type temporal resolution of the signal day or week
geo_type spatial resolution of the signal county, hrr, msa, or state
time_values time units over which events happened 20200406 or 20200406-20200410
geo_value location codes, depending on geo_type * for all, or pa for Pennsylvania

R and Python Packages

We also provide R and Python packages for API access. Highlights:

(Have an idea? File an issue or contribute a PR on our public GitHub repo)

List of Indicators

library(covidcast)
covidcast_meta() %>%
  group_by(data_source, signal) %>%
  summarize(county = ifelse("county" %in% geo_type, "*", ""),
            msa = ifelse("msa" %in% geo_type, "*", ""),
            hrr = ifelse("hrr" %in% geo_type, "*", ""),
            state = ifelse("state" %in% geo_type, "*", "")) %>%
  mutate(signal = ifelse(nchar(signal) <= 35, signal,
                         paste0(substr(signal, 1, 32), "..."))) %>%
  slice(grep("(raw|7dav|\\_w)", signal, invert = TRUE)) %>%
  as.data.frame() %>%
  print(right = FALSE, row.names = FALSE)
##  data_source           signal                              county msa hrr state
##  chng                  smoothed_adj_outpatient_cli         *      *   *   *    
##  chng                  smoothed_adj_outpatient_covid       *      *   *   *    
##  chng                  smoothed_outpatient_cli             *      *   *   *    
##  chng                  smoothed_outpatient_covid           *      *   *   *    
##  covid-act-now         pcr_specimen_positivity_rate        *      *   *   *    
##  covid-act-now         pcr_specimen_total_tests            *      *   *   *    
##  doctor-visits         smoothed_adj_cli                    *      *   *   *    
##  doctor-visits         smoothed_cli                        *      *   *   *    
##  fb-survey             smoothed_accept_covid_vaccine       *      *   *   *    
##  fb-survey             smoothed_anxious_5d                 *      *   *   *    
##  fb-survey             smoothed_anxious_7d                 *      *   *   *    
##  fb-survey             smoothed_cli                        *      *   *   *    
##  fb-survey             smoothed_covid_vaccinated           *      *   *   *    
##  fb-survey             smoothed_covid_vaccinated_or_accept *      *   *   *    
##  fb-survey             smoothed_depressed_5d               *      *   *   *    
##  fb-survey             smoothed_depressed_7d               *      *   *   *    
##  fb-survey             smoothed_dontneed_reason_dont_sp... *      *   *   *    
##  fb-survey             smoothed_dontneed_reason_had_covid  *      *   *   *    
##  fb-survey             smoothed_dontneed_reason_not_ben... *      *   *   *    
##  fb-survey             smoothed_dontneed_reason_not_hig... *      *   *   *    
##  fb-survey             smoothed_dontneed_reason_not_ser... *      *   *   *    
##  fb-survey             smoothed_dontneed_reason_other      *      *   *   *    
##  fb-survey             smoothed_dontneed_reason_precaut... *      *   *   *    
##  fb-survey             smoothed_felt_isolated_5d           *      *   *   *    
##  fb-survey             smoothed_felt_isolated_7d           *      *   *   *    
##  fb-survey             smoothed_hesitancy_reason_allergic  *      *   *   *    
##  fb-survey             smoothed_hesitancy_reason_cost      *      *   *   *    
##  fb-survey             smoothed_hesitancy_reason_dislik... *      *   *   *    
##  fb-survey             smoothed_hesitancy_reason_distru... *      *   *   *    
##  fb-survey             smoothed_hesitancy_reason_distru... *      *   *   *    
##  fb-survey             smoothed_hesitancy_reason_health... *      *   *   *    
##  fb-survey             smoothed_hesitancy_reason_ineffe... *      *   *   *    
##  fb-survey             smoothed_hesitancy_reason_low_pr... *      *   *   *    
##  fb-survey             smoothed_hesitancy_reason_not_re... *      *   *   *    
##  fb-survey             smoothed_hesitancy_reason_other     *      *   *   *    
##  fb-survey             smoothed_hesitancy_reason_pregnant  *      *   *   *    
##  fb-survey             smoothed_hesitancy_reason_religious *      *   *   *    
##  fb-survey             smoothed_hesitancy_reason_sideef... *      *   *   *    
##  fb-survey             smoothed_hesitancy_reason_unnece... *      *   *   *    
##  fb-survey             smoothed_hh_cmnty_cli               *      *   *   *    
##  fb-survey             smoothed_ili                        *      *   *   *    
##  fb-survey             smoothed_inperson_school_fulltime   *      *   *   *    
##  fb-survey             smoothed_inperson_school_parttime   *      *   *   *    
##  fb-survey             smoothed_large_event_1d             *      *   *   *    
##  fb-survey             smoothed_large_event_indoors_1d     *      *   *   *    
##  fb-survey             smoothed_nohh_cmnty_cli             *      *   *   *    
##  fb-survey             smoothed_others_masked              *      *   *   *    
##  fb-survey             smoothed_public_transit_1d          *      *   *   *    
##  fb-survey             smoothed_received_2_vaccine_doses   *      *   *   *    
##  fb-survey             smoothed_restaurant_1d              *      *   *   *    
##  fb-survey             smoothed_restaurant_indoors_1d      *      *   *   *    
##  fb-survey             smoothed_screening_tested_positi... *      *   *   *    
##  fb-survey             smoothed_shop_1d                    *      *   *   *    
##  fb-survey             smoothed_shop_indoors_1d            *      *   *   *    
##  fb-survey             smoothed_spent_time_1d              *      *   *   *    
##  fb-survey             smoothed_spent_time_indoors_1d      *      *   *   *    
##  fb-survey             smoothed_tested_14d                 *      *   *   *    
##  fb-survey             smoothed_tested_positive_14d        *      *   *   *    
##  fb-survey             smoothed_travel_outside_state_5d    *      *   *   *    
##  fb-survey             smoothed_travel_outside_state_7d    *      *   *   *    
##  fb-survey             smoothed_vaccine_likely_doctors     *      *   *   *    
##  fb-survey             smoothed_vaccine_likely_friends     *      *   *   *    
##  fb-survey             smoothed_vaccine_likely_govt_health *      *   *   *    
##  fb-survey             smoothed_vaccine_likely_local_he... *      *   *   *    
##  fb-survey             smoothed_vaccine_likely_politicians *      *   *   *    
##  ght                   smoothed_search                            *   *   *    
##  google-survey         smoothed_cli                        *      *   *   *    
##  google-symptoms       ageusia_smoothed_search             *      *   *   *    
##  google-symptoms       anosmia_smoothed_search             *      *   *   *    
##  google-symptoms       sum_anosmia_ageusia_smoothed_search *      *   *   *    
##  hhs                   confirmed_admissions_1d                            *    
##  hhs                   confirmed_admissions_covid_1d                      *    
##  hhs                   sum_confirmed_suspected_admissio...                *    
##  hhs                   sum_confirmed_suspected_admissio...                *    
##  hospital-admissions   smoothed_adj_covid19                *      *   *   *    
##  hospital-admissions   smoothed_adj_covid19_from_claims    *      *   *   *    
##  hospital-admissions   smoothed_covid19                    *      *   *   *    
##  hospital-admissions   smoothed_covid19_from_claims        *      *   *   *    
##  indicator-combination confirmed_cumulative_num            *      *   *   *    
##  indicator-combination confirmed_cumulative_prop           *      *   *   *    
##  indicator-combination confirmed_incidence_num             *      *   *   *    
##  indicator-combination confirmed_incidence_prop            *      *   *   *    
##  indicator-combination deaths_cumulative_num               *      *   *   *    
##  indicator-combination deaths_cumulative_prop              *      *   *   *    
##  indicator-combination deaths_incidence_num                *      *   *   *    
##  indicator-combination deaths_incidence_prop               *      *   *   *    
##  indicator-combination nmf_day_doc_fbc_fbs_ght             *      *       *    
##  indicator-combination nmf_day_doc_fbs_ght                 *      *       *    
##  jhu-csse              confirmed_cumulative_num            *      *   *   *    
##  jhu-csse              confirmed_cumulative_prop           *      *   *   *    
##  jhu-csse              confirmed_incidence_num             *      *   *   *    
##  jhu-csse              confirmed_incidence_prop            *      *   *   *    
##  jhu-csse              deaths_cumulative_num               *      *   *   *    
##  jhu-csse              deaths_cumulative_prop              *      *   *   *    
##  jhu-csse              deaths_incidence_num                *      *   *   *    
##  jhu-csse              deaths_incidence_prop               *      *   *   *    
##  nchs-mortality        deaths_allcause_incidence_num                      *    
##  nchs-mortality        deaths_allcause_incidence_prop                     *    
##  nchs-mortality        deaths_covid_and_pneumonia_notfl...                *    
##  nchs-mortality        deaths_covid_and_pneumonia_notfl...                *    
##  nchs-mortality        deaths_covid_incidence_num                         *    
##  nchs-mortality        deaths_covid_incidence_prop                        *    
##  nchs-mortality        deaths_flu_incidence_num                           *    
##  nchs-mortality        deaths_flu_incidence_prop                          *    
##  nchs-mortality        deaths_percent_of_expected                         *    
##  nchs-mortality        deaths_pneumonia_notflu_incidenc...                *    
##  nchs-mortality        deaths_pneumonia_notflu_incidenc...                *    
##  nchs-mortality        deaths_pneumonia_or_flu_or_covid...                *    
##  nchs-mortality        deaths_pneumonia_or_flu_or_covid...                *    
##  quidel                covid_ag_smoothed_pct_positive      *      *   *   *    
##  quidel                smoothed_pct_negative                      *       *    
##  quidel                smoothed_tests_per_device                  *       *    
##  safegraph             bars_visit_num                      *      *   *   *    
##  safegraph             bars_visit_prop                     *      *   *   *    
##  safegraph             completely_home_prop                *      *   *   *    
##  safegraph             median_home_dwell_time              *      *   *   *    
##  safegraph             restaurants_visit_num               *      *   *   *    
##  safegraph             restaurants_visit_prop              *      *   *   *    
##  usa-facts             confirmed_cumulative_num            *      *   *   *    
##  usa-facts             confirmed_cumulative_prop           *      *   *   *    
##  usa-facts             confirmed_incidence_num             *      *   *   *    
##  usa-facts             confirmed_incidence_prop            *      *   *   *    
##  usa-facts             deaths_cumulative_num               *      *   *   *    
##  usa-facts             deaths_cumulative_prop              *      *   *   *    
##  usa-facts             deaths_incidence_num                *      *   *   *    
##  usa-facts             deaths_incidence_prop               *      *   *   *    
##  youtube-survey        smoothed_cli                                       *    
##  youtube-survey        smoothed_ili                                       *

Example: Deaths

How many COVID-19 deaths have been reported per day, in my state, since March 1?

start_day = "2020-03-01"
end_day = "2021-04-28"
deaths = covidcast_signal(data_source = "usa-facts", 
                          signal = "deaths_7dav_incidence_num", 
                          start_day = start_day, end_day = end_day,
                          geo_type = "state", geo_values = "pa")

plot(deaths, plot_type = "line", 
     title = "New COVID-19 deaths in PA (7-day average)") + 
  scale_x_date(date_breaks = "1 month", date_labels = "%b") +
  theme(legend.position = "none")

Example: Hospitalizations

What percentage of daily hospital admissions are due to COVID-19 in PA, NY, TX?

hosp = covidcast_signal(data_source = "hospital-admissions", 
                        signal = "smoothed_adj_covid19_from_claims",
                        start_day = start_day, end_day = end_day,
                        geo_type = "state", geo_values = c("pa", "ny", "tx"))

plot(hosp, plot_type = "line", 
     title = "% of hospital admissions due to COVID-19") + 
  geom_dl(aes(y = value, color = geo_value, label = toupper(geo_value)), 
          method = "last.bumpup") +
  scale_x_date(date_breaks = "1 month", date_labels = "%b") +
  theme(legend.position = "none")

Example: Total Cases

What does the current COVID-19 cumulative case rate look like, nationwide?

cases = covidcast_signal(data_source = "usa-facts", 
                         signal = "confirmed_cumulative_prop",
                         start_day = end_day, end_day = end_day)

end_day_str = format.Date(end_day, "%B %d %Y")
plot(cases, title = "Cumulative COVID-19 cases per 100,000 people", 
     range = c(0, 12500), 
     choro_params = list(subtitle = end_day_str, legend_n = 6))

Example: Doctor’s Visits

How do some cities compare in terms of doctor’s visits due to COVID-like illness?

dv = covidcast_signal(data_source = "doctor-visits", 
                      signal = "smoothed_adj_cli", 
                      start_day = start_day, end_day = end_day,
                      geo_type = "msa", 
                      geo_values = name_to_cbsa(c("Miami", "New York", 
                                                  "Pittsburgh", "San Antonio")))

plot(dv, plot_type = "line", 
     title = "% of doctor's visits due to COVID-like illness") + 
  scale_x_date(date_breaks = "1 month", date_labels = "%b") +
  scale_color_hue(labels = cbsa_to_name(unique(dv$geo_value)))

Example: Symptoms

How do my county and my friend’s county compare in terms of COVID symptoms?

sympt = covidcast_signal(data_source = "fb-survey", 
                         signal = "smoothed_hh_cmnty_cli",
                         start_day = "2020-04-15", end_day = end_day,
                         geo_values = c(name_to_fips("Allegheny"),
                                        name_to_fips("Fulton", 
                                                     state = "GA")))

plot(sympt, plot_type = "line", 
     title = "% of people who know somebody with COVID symptoms") + 
  scale_x_date(date_breaks = "1 month", date_labels = "%b") +
  scale_color_hue(labels = fips_to_name(unique(sympt$geo_value)))

Example: Mask Use

How do some states compare in terms of self-reported mask useage?

states = c("dc", "ma", "ny", "wy", "sd", "id")
mask1 = covidcast_signal(data_source = "fb-survey", 
                        signal = "smoothed_wwearing_mask",
                        start_day = "2020-09-15", end_day = "2021-02-10",
                        geo_type = "state", geo_values = states)
mask2 = covidcast_signal(data_source = "fb-survey", 
                        signal = "smoothed_wwearing_mask_7d", 
                        start_day = "2021-02-11", end_day = end_day,
                        geo_type = "state", geo_values = states)
mask = rbind(mask1, mask2)

plot(mask, plot_type = "line", 
     title = "% of people who wear masks in public most/all the time") +
  geom_dl(aes(y = value, color = geo_value, label = toupper(geo_value)), 
          method = "last.bumpup") +
  scale_x_date(date_breaks = "1 month", date_labels = "%b") +
  theme(legend.position = "none")

Example: Vaccines

How about vaccine uptake (self-reported), and willingness to take vaccine (if not yet vaccinated)?

states = c("dc", "ma", "ny", "wy", "sd", "id")
vaccine = covidcast_signals(data_source = "fb-survey", 
                            signal = c("smoothed_wcovid_vaccinated",
                                       "smoothed_waccept_covid_vaccine"),
                            start_day = "2021-01-15", end_day = end_day,
                            geo_type = "state", geo_values = states)

g1 = plot(vaccine[[1]], plot_type = "line", 
     title = "% of people who have received COVID-19 vaccine, self-reported") +
  geom_dl(aes(y = value, color = geo_value, label = toupper(geo_value)), 
          method = "last.bumpup") +
  scale_x_date(date_breaks = "1 month", date_labels = "%b") +
  theme(legend.position = "none")
g2 = plot(vaccine[[2]], plot_type = "line", 
     title = "% of people who would accept COVID-19 vaccine, if haven't yet") +
  geom_dl(aes(y = value, color = geo_value, label = toupper(geo_value)), 
          method = "last.bumpup") +
  scale_x_date(date_breaks = "1 month", date_labels = "%b") +
  theme(legend.position = "none")
grid.arrange(g1, g2, nrow = 1)

As Of, Issues, Lag

By default the API returns the most recent data for each time_value. We also provide access to all previous versions of the data, using the following optional parameters:

Parameter To get data … Examples
as_of as if we queried the API on a particular date 20200406
issues published at a particular date or date range 20200406 or 20200406-20200410
lag published a certain number of time units after events occured 1 or 3

Data Revisions

Why would we need this? Because many data sources are subject to revisions:

This presents a challenge to modelers: e.g., we have to learn how to forecast based on the data we’d have at the time, not updates that would arrive later. To accommodate, we log revisions even when the original data source does not!

Example: Backfill in Doctor’s Visits

The last two weeks of August in CA …

# Let's get the data that was available as of 09/22, for the end of August in CA
dv = covidcast_signal(data_source = "doctor-visits", 
                      signal = "smoothed_adj_cli",
                      start_day = "2020-08-15", end_day = "2020-08-31",
                      geo_type = "state", geo_values = "ca",
                      as_of = "2020-09-21")

# Plot the time series curve
xlim = c(as.Date("2020-08-15"), as.Date("2020-09-21"))
ylim = c(3.83, 5.92)
ggplot(dv, aes(x = time_value, y = value)) + 
  geom_line() +
  coord_cartesian(xlim = xlim, ylim = ylim) +
  geom_vline(aes(xintercept = as.Date("2020-09-21")), lty = 2) +
  labs(color = "as of", x = "Date", y = "% doctor's visits due to CLI in CA") +
  theme_bw() + theme(legend.position = "bottom")

Example: Backfill in Doctor’s Visits (Cont.)

The last two weeks of August in CA …

# Now loop over a bunhch of "as of" dates, fetch data from the API for each one
as_ofs = seq(as.Date("2020-09-01"), as.Date("2020-09-21"), by = "week")
dv_as_of = map_dfr(as_ofs, function(as_of) {
  covidcast_signal(data_source = "doctor-visits", signal = "smoothed_adj_cli",
                   start_day = "2020-08-15", end_day = "2020-08-31", 
                   geo_type = "state", geo_values = "ca", as_of = as_of)
})

# Plot the time series curve "as of" September 1
dv_as_of %>% 
  filter(issue == as.Date("2020-09-01")) %>% 
  ggplot(aes(x = time_value, y = value)) + 
  geom_line(aes(color = factor(issue))) + 
  coord_cartesian(xlim = xlim, ylim = ylim) +
  geom_vline(aes(color = factor(issue), xintercept = issue), lty = 2) +
  labs(color = "as of", x = "Date", y = "% doctor's visits due to CLI in CA") +
  geom_line(data = dv, aes(x = time_value, y = value)) +
  geom_vline(aes(xintercept = as.Date("2020-09-21")), lty = 2) +
  theme_bw() + theme(legend.position = "none")

Example: Backfill in Doctor’s Visits (Cont.)

The last two weeks of August in CA …

dv_as_of %>% 
  ggplot(aes(x = time_value, y = value)) + 
  geom_line(aes(color = factor(issue))) + 
  coord_cartesian(xlim = xlim, ylim = ylim) +
  geom_vline(aes(color = factor(issue), xintercept = issue), lty = 2) +
  labs(color = "as of", x = "Date", y = "% doctor's visits due to CLI in CA") +
  geom_line(data = dv, aes(x = time_value, y = value)) +
  geom_vline(aes(xintercept = as.Date("2020-09-21")), lty = 2) +
  theme_bw() + theme(legend.position = "none")

Ongoing Pandemic Survey

Through recruitment partnership with Facebook, we survey about 50,000 people daily (and over 20 million since it began in April), in the US. Topics include:

This is the largest non-Census research survey ever conducted. Raw survey response data is available to researchers under a data use agreement. A parallel, international effort by the University of Maryland reaches 100+ countries in 55 languages

Part 2: Lessons Learned

Part 2: Lessons Learned

Lessons and Reflections

An attempt to distill some lessons learned from the past year, related to statistical modeling and machine learning, broken down by three areas:

A: Forecasting in a Pandemic is Hard

The COVID-19 Forecast Hub collects short-term forecasts of incident COVID-19 cases, hospitalizations, and deaths. These are made by 50+ groups of “citizen scientists”, and power the CDC’s official communications on COVID-19 forecasting


This is not an easy problem:

(All of this—plus an additional model-level nonstationarity—carries over to building an ensemble model!)

A: Forecasting in a Pandemic is Hard (Cont.)

Only a small handful of models consistently outperform the baseline (essentially the flat-line forecaster). For example, from Cramer et al. (2021):


A: Forecasting in a Pandemic is Hard (Cont.)

Lessons/reflections:

B. Nowcasting is Dark Horse Candidate for MVP

Nowcasting: estimating the value of a signal that will only be fully-observed at a later date. Current data is partial/noisy, but progressively improves as time passes


Example: suppose we want to use medical insurance claims to estimate how many people have some disease on some day (in some location)

B. Nowcasting is Dark Horse Candidate for MVP (Cont.)

Meanwhile, in COVID-19, it’s even more complicated:

Even settling for the penultimate bullet, we would be nowcasting a latent variable (never observed)

B. Nowcasting is Dark Horse Candidate for MVP (Cont.)

B. Nowcasting is Dark Horse Candidate for MVP (Cont.)

Lessons/reflections:

While time scales may change, nowcasting is not going away as a central problem in public health …

C. Taking Risks (When You Can Afford To Do So)

The beginning of the pandemic created a clear pull for computational scientists: fetch case and death data from JHU CSSE’s GitHub, learn about SIR modeling, inject stochasticity, start making forecasts

We decided early on to swim against the stream. It’s not that this work wasn’t important, but rather, we felt we could create greater value by working on the data problem (to hopefully benefit many others)

We wouldn’t/couldn’t have taken this risk if there weren’t so many strong computational scientists who jumped into work on forecasting

C. Taking Risks (When You Can Afford To Do So, Cont.)

It can be hard to quantify the value of good data. We will be trying to do this for years to come (not just us/our data … this is an important undertaking for the whole scientific community)

That said, we are starting to see (in retrospect) some encouraging results in problems where you can quantify value, like forecasting and nowcasting


Thanks

For more, visit https://covidcast.cmu.edu (you’ll find everything linked from there)